{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 201 - Engineering Text Features Using `mmlspark` Modules and Spark SQL\n",
    "\n",
    "Again, try to predict Amazon book ratings greater than 3 out of 5, this time using\n",
    "the `TextFeaturizer` module which is a composition of several text analytics APIs that\n",
    "are native to Spark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import mmlspark\n",
    "from pyspark.sql.types import IntegerType, StringType, StructType, StructField"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataFile = \"BookReviewsFromAmazon10K.tsv\"\n",
    "textSchema = StructType([StructField(\"rating\", IntegerType(), False),\n",
    "                         StructField(\"text\", StringType(), False)])\n",
    "import os, urllib\n",
    "if not os.path.isfile(dataFile):\n",
    "    urllib.request.urlretrieve(\"https://mmlspark.azureedge.net/datasets/\"+dataFile, dataFile)\n",
    "data = spark.createDataFrame(pd.read_csv(dataFile, sep=\"\\t\", header=None), textSchema)\n",
    "data.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `TextFeaturizer` to generate our features column.  We remove stop words, and use TF-IDF\n",
    "to generate 2\u00b2\u2070 sparse features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark.TextFeaturizer import TextFeaturizer\n",
    "textFeaturizer = TextFeaturizer() \\\n",
    "  .setInputCol(\"text\").setOutputCol(\"features\") \\\n",
    "  .setUseStopWordsRemover(True).setUseIDF(True).setMinDocFreq(5).setNumFeatures(1 << 16).fit(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processedData = textFeaturizer.transform(data)\n",
    "processedData.limit(5).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the label so that we can predict whether the rating is greater than 3 using a binary\n",
    "classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "processedData = processedData.withColumn(\"label\", processedData[\"rating\"] > 3) \\\n",
    "                             .select([\"features\", \"label\"])\n",
    "processedData.limit(5).toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train several Logistic Regression models with different regularizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test, validation = processedData.randomSplit([0.60, 0.20, 0.20])\n",
    "from pyspark.ml.classification import LogisticRegression\n",
    "\n",
    "lrHyperParams = [0.05, 0.1, 0.2, 0.4]\n",
    "logisticRegressions = [LogisticRegression(regParam = hyperParam) for hyperParam in lrHyperParams]\n",
    "\n",
    "from mmlspark.TrainClassifier import TrainClassifier\n",
    "lrmodels = [TrainClassifier(model=lrm, labelCol=\"label\").fit(train) for lrm in logisticRegressions]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the model with the best AUC on the test set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark import FindBestModel, BestModel\n",
    "bestModel = FindBestModel(evaluationMetric=\"AUC\", models=lrmodels).fit(test)\n",
    "bestModel.getEvaluationResults().show()\n",
    "bestModel.getBestModelMetrics().show()\n",
    "bestModel.getAllModelMetrics().show()\n",
    "bestModel.write().overwrite().save(\"model.mml\")\n",
    "loadedBestModel = BestModel.load(\"model.mml\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the optimized `ComputeModelStatistics` API to find the model accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mmlspark.ComputeModelStatistics import ComputeModelStatistics\n",
    "predictions = loadedBestModel.transform(validation)\n",
    "metrics = ComputeModelStatistics().transform(predictions)\n",
    "print(\"Best model's accuracy on validation set = \"\n",
    "      + \"{0:.2f}%\".format(metrics.first()[\"accuracy\"] * 100))"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
